2022/09/22

The Basic Problem

Hypotheses about mean differences are very common.

Typical question are:

Does the mean

  • differ between group A and group B? between design
  • of group A change from time 1 to time 2 (e.g., before and after treatment)? within design
  • change from time 1 to time 2 between group A and group B differ? mixed design

Between vs Within Design

BETWEEN WITHIN example/explanation
learning and transfer across conditions no yes comparing difficulty of IQ-tests
effort per participant low high just one, instead of multiple tests
difficulty to set up low high comparing IQ-tests: random assignment VS. random sequence per participant
random noise high low potential confounding factors are hold constant by using the same person
required sample size high low participants contribute multiple data points + less noise

Sample size example (2 groups)

params <- list(sig.level = 0.05, power = 0.80, d = 0.25)
do.call(pwr::pwr.t.test, c(type='two.sample', params)) %>% .$n
## [1] 252.1275

between: 253 participants per group

do.call(pwr::pwr.t.test, c(type='paired', params)) %>% .$n
## [1] 127.5158

within: 128 pairs

When is comparing the mean appropriate?

Appropriateness of mean as measure of central tendency

  • the measure is at least interval scaled (there is order and differences are meaningful)
  • unimodal distribution in each group
  • symmetric distribution in each group (not too much skew)
  • no outliers

@unimodal

bimodal <- tibble(data=FamilyRank::rbinorm(1000, -1, 1, .4, .4, .5))
ggplot(bimodal, aes(data)) +
  geom_histogram(bins = 40) +
  geom_vline(aes(xintercept = mean(bimodal$data)),col='red',size=2)

@symmetric

gamma <- tibble(data=rgamma(1000, 2, 1))
ggplot(gamma, aes(data)) +
  geom_histogram(bins = 40) +
  geom_vline(aes(xintercept = mean(gamma$data)),col='red',size=2)

@outlier

out <- tibble(data=c(rnorm(980), rnorm(20, 50, 5)))
ggplot(out, aes(data)) +
  geom_histogram(bins = 100) +
  geom_vline(aes(xintercept = mean(out$data)),col='red',size=2)

How to test for differences in means?

  • most typical: Analysis of Variance
  • main benefit - it’s versatile:
    • multiple groups
    • multiple factors
    • between, within, and mixed design variants
    • multiple measures
    • allows covariates
  • main drawback - if conditions are met it is BLUE, if not it becomes liberal (p of rejecting H0 is too big)
    • normal distribution
    • variance homogeneity

The issue of testing the conditions

  • pre-test are statistical tests as well. They have their own Type I and Type II errors and are dependent on n.
  • this causes a special form of a multiple testing scenario
  • sequential: typical corrections like Bonferroni or Benjamini-Hochberg for parallel testing don’t apply
  • consequence: unknown final type-I- and type-II-risks (see, e.g., Rasch et al., 2011)
  • Recommendation: if there is a good, robust alternative, then use it right from the start
  • at least robust against violation of variance homogeneity, because this is the bigger issue than deviation from normal distribution if the mean is a appropriate measure of the central tendency (see before)

Robust Alternatives

  • parametric corrections:
  • non-parametric:
    • distributional tests (e.g. Kruskal-Wallis Test)
      • low power
      • only applicable if distributions are the same except for location shift
      • not available for many designs
    • permutation tests
      • exact test
      • can become computationally expensive
  • make parametric more robust with
    • trimmed means
    • bootstrapping
    • transformations

Digression: Resampling Methods

Permutation test

Base assumption: the sample groups are drawn from the same distribution (H0) –> switching participants between groups shouldn’t matter.

Steps:

  1. value of interest (e.g., difference of mean between groups) is measured in sample.
  2. all data is pooled
  3. all permutations, that result in the original group sizes are drawn (then it is exact, often just a sample is used)
  4. value of interest is calculated in each permutation
  5. the original value is compared to all others
  6. if it is bigger than 95%, then the H0 is rejected (α=.05, one-sided)
  7. for a two-sided test the absolute value is compared with all absolute values

For a nice visual explanation see here.

Digression: Resampling Methods

Bootstrapping

Base concept: random sampling with replacement

Allows the estimation of the sampling distribution of almost any statistic, which can be used for measures of accuracy (e.g., confidence intervals)

Digression: Trimmed Means

Base concept: use only the middle 95/90/80 % of the samples

  • applied to achieve fewer sampling fluctuations
  • rationale: extreme values are unrepresentative but strongly impact the mean

Digression: Transformation (Part 1/2)

Used to convert non-normally distributed data into normally distributed data.

  • most frequent transformation: log, but not necessarily the best choice
  • if there is no theoretical reason for a specific transformation, use Box-Cox Power Transformation to find the “best fitting” one
  • Box-Cox transform is used to find the optimal \(\lambda\) in \(\displaystyle \frac{y^{\lambda} - 1}{\lambda}\)

Digression: Transformation (Part 2/2)

Why you might not want to use it!

  • Box-Cox works only for positive values (solution: shift into positive range or Yeo–Johnson transformation)
  • Interpretation is hard (hardly anybody understands logs, much less arbitrary power transformations)
  • it is not a linear transformation, therefore the back transformation can introduce bias
  • e.g., the raw mean in the previous example was 0.507, but back-transforming the mean of the transformed values yields 0.214

Thank you for your attention!

Next Time:

Mean differences - Practical Examples